The goal of this document is to study how certain covariates in the Brain Stroke Dataset depend on eachother. The covariates included in the data set are gender, age, hypertension, heart disease, worktype, average glucose level, bmi, smoking status, and stroke. Previous literature suggest that heart disease, high blood pressure, diabetes, cholesterol levels, smoking status, age, and sex are risk factors. We would like to study if these risk factors, as well as the other covariates, are determine stroke in this data set. In this examination, we determine the most likely risk factors in this dataset and reveal a number of interesting interactions between covariates. We also build a model purely for prediction accuracy, and test the reliability of the data with a cross validation.
A stroke is a medical condition in which poor blood flow in the brain causes brain cell deaths. Strokes can be caused either by bleeding in the brain, which is classified as a hemorrhagic stroke, or by a lack of blood flow to the brain, which is classified as an ischemic stroke. Medical literature attributes many causes for a stroke: high blood pressure, cholesterol levels, and cardiovascular diseases can increase the risk of a stroke, most often by causing blood clots that may dislodge and then block blood vessels. Other conditions, such as diabetes, smoking, aneurysms, inflammation, and comorbidities may increase either the risk of having a stroke or the severity of it.
In this dataset is recorded several covariates:
Our main goal will be to correlate stroke with the other variables to asses them as risk factors.
gender: “Male”, “Female” or “Other”
age: age of the patient
hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heart disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
ever-married: “No” or “Yes”
worktype: “children”, “Govtjov”, “Neverworked”, “Private” or “Self-employed”
Residencetype: “Rural” or “Urban”
avgglucoselevel: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
stroke: 1 if the patient had a stroke or 0 if not
The presence of many categorical and continuous covariates poses a challenge, and we will make a note to be wary of confounders and paradoxes, such as we will soon find in the relationship between Stroke, Age, and BMI. We will investigate such interactions with models and plots.
The first step to understanding a (small enough) dataset is to view the covariates individually. With some quick summary statistics, we can get an idea of what we are looking at:
As expected, the value of stroke is either 0 or 1. Age, somewhat
surprisingly, varies all the way from 0.08 to 82. It is clear from the
dataset the low ages are not misinputs, so we may assume that our study
includes data about very young patients. BMI varies from 14 to 48,
reasonable values for a human dataset. We may also see that there are a
large amount of unknown smoking statuses (about 1/3ish of the patients),
both heart disease and hypertension have relatively low occurence (less
than 10%), most patients were married, and there are more female
patients than male. Literature suggests that stroke is more likely in
young male patients, but the longer life expectancy of female subjects
creates a survival bias that inflates the prevalence among older female
patients. It remains to be seen if this is observered in our dataset.
Furthermore, the ages are roughly evenly distributed, with no especially
large tendency towards young or old.
It was not mentioned how average blood glucose levels were measured, but it is most likely with an A1c screening. Thus, we might try to use average blood glucose levels as a stand in for diabetes. We may use average blood glucose as a continuous covariate to preserve accuracy, or we may attempt to stratify into normal (<117 mg/dL), prediabetic (117-137 mg/dl), and diabetic (>137 mg/dL) for ease of interpretation.
Now that we’ve gotten our bearings a little, we can ask the plots for the simplest questions. Firstly, how does stroke risk vary with our continuous covariates?
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.20439617 0.336973944 -21.37968 2.065266e-101
## age 0.07446153 0.004945627 15.05603 3.151363e-51
Binomial GLM, logit link, Stroke vs Age
In the above plot (code can be found in the EDAvisual.R file), we plot the occurrences of stroke/no stroke against age. Overlaid in a red line is a glm, fitting risk of stroke against age. The glm estimates an 0.075 increase in odds per year increase in age. Furthermore, the null over residual deviance suggests that the fit is relatively appropriate, and we do not worry yet about zero inflations or other potential issues quite yet. First, we investigate the other bivariate cases to satisfy our intuition.
Since we have observed that age increases risk, what about BMI?
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.03719414 0.286754961 -14.078899 5.120123e-45
## bmi 0.03716173 0.009281048 4.004045 6.226863e-05
Binomial GLM, logit link, Stroke vs BMI
In the above plot, we now see that BMI is associated with an increasing risk of stroke. Furthermore, we will a similar association in average blood glucose levels below:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14304948 0.160825618 -25.761129 2.419073e-146
## avg_glucose_level 0.01020692 0.001132909 9.009477 2.070407e-19
Binomial GLM, logit link, Stroke vs Average Glucose Level
Now we may move on to the categorical covariates: these may be investigated with a table.
Bar Plot of Stroke Proportion vs Smoking Status
What a strange trend! The prevalence of stroke among never smoked and smokes is similar, while the prevalence in former smokers is much higher! It is biologically unlikely that smoking then stopping has a special ability to prevent strokes. It is more likely that there is some kind of multiple dependence or sample bias: maybe those who formerly smoked stopped because they experienced a health complication, or those who had the time to smoke then stop tended to be older. There is no clear way to interpret such bias without sampling more data, so we are stuck with only speculation.
Moving on, we may next take a look at hypertension and heart disease. Since both of these are strongly medically related, I will plot them with interaction.
## Estimate Std. Error z value
## I((!hypertension) * (!heart_disease)) -3.331963 0.08365478 -39.829921
## I(hypertension * (!heart_disease)) -1.921352 0.14707262 -13.063970
## I((!hypertension) * (heart_disease)) -1.649789 0.18724712 -8.810759
## I(hypertension * heart_disease) -1.366876 0.31069425 -4.399426
## Pr(>|z|)
## I((!hypertension) * (!heart_disease)) 0.000000e+00
## I(hypertension * (!heart_disease)) 5.289568e-39
## I((!hypertension) * (heart_disease)) 1.243013e-18
## I(hypertension * heart_disease) 1.085378e-05
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.012417 0.07050906 -42.72383 0.000000e+00
## hypertension 1.143021 0.15167784 7.53585 4.851625e-14
Bar plot, Stroke risk by hypertension and heart disease
From the bar plot, we can see that those with both heart disease and hypertension have the most risk. Then, in descending order, heart disease only, hypertension only, and none. The model is kind enough to tell us that these differences are significant. We can also use a simple glm to determine that those with hypertension have a significantly higher risk of also having heart disease, indicating that these two covariates are indeed related.
Now that we have some intuition about what to expect, we can move on to all the covariates in one model.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.954558530 0.793641363 -8.76284788 1.903785e-18
## genderMale 0.007036884 0.142195582 0.04948736 9.605309e-01
## age 0.075149519 0.005870475 12.80126735 1.612971e-37
## hypertension 0.416767410 0.165174120 2.52320043 1.162921e-02
## heart_disease 0.272296918 0.191117328 1.42476311 1.542257e-01
## ever_marriedYes -0.193136036 0.225784690 -0.85539917 3.923302e-01
## work_typeGovt_job -1.028636493 0.837738674 -1.22787275 2.194947e-01
## work_typePrivate -0.907777857 0.822692135 -1.10342353 2.698433e-01
## work_typeSelf-employed -1.270196189 0.843182619 -1.50643071 1.319566e-01
## Residence_typeUrban 0.087944655 0.138818250 0.63352372 5.263917e-01
## avg_glucose_level 0.003812709 0.001207845 3.15662172 1.596083e-03
## bmi 0.010867746 0.012625631 0.86076856 3.893655e-01
## smoking_statusnever smoked -0.224347912 0.176587566 -1.27046268 2.039199e-01
## smoking_statussmokes 0.111463650 0.215515157 0.51719634 6.050191e-01
## smoking_statusUnknown -0.066745558 0.208598611 -0.31997125 7.489901e-01
In our model, we have some surprising results: gender, heart disease, smoking status, and bmi are no longer significant! This may be due to some form of multicollinearity between it and the significant variables, but we would suspect these covariates to be significant regardless of multiple dependence. Gender, for example, is suggested to be a significant risk factor, but only depending on age. Smoking status was also supposed to be a risk factor, and intuitively should not be completely dependent on the others as we do not have a variable for lung disease. We also tried multiple variable selection techniques, such as forwards/backwards/bidirectional stepwise selection and LASSO. These did not yield any further insight, and they will be relegated to the EDA.r file. So, lets investigate further.
Our next model will be a generalized additive model. This type of model allows for easy semiparametric multivariate modelling because of the additive treatment of the multiple dimensions univariately as well as capability for multivariate smoothing, such as with tensor product splines. We can then test for nonlinear relationships and for interactions. First, we establish a base model with only the significant covariates from the full model as well as BMI, which we suspect to be relevant:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.794692407 0.529829369 -14.7117032 5.422525e-49
## age 0.070607319 0.005176366 13.6403271 2.305242e-42
## hypertension 0.399041328 0.163079901 2.4469068 1.440881e-02
## avg_glucose_level 0.004176358 0.001184467 3.5259375 4.219866e-04
## bmi 0.008166512 0.012382875 0.6595005 5.095744e-01
In case you’re not familiar with the tensor product splines and the SS-ANOVA in spline models, special multivariate models called “tensor product splines” can be constructed from the Reproducing Kernel Hilbert Space point of view. The details are beyond the scope of this report, but the punchline is that the model space can be broken up into a collection of orthogonal subspaces. The fact that these subspaces are orthogonal allows for an ANOVA based on a sort of decomposition of deviances explained, so that smooth terms in the model can be “significance tested.” The unpenalized subspace, or the parametric part, is allowed to vary completely freely to minimize the objective function. In practice, these are usually very rigid and interpretable subspaces, such as the subspace of linear models. The non-parametric part, or smooth term, is a shape-fitting regression that can adjust to any shape, but is prevented from overfitting by a penalty functional. The classical non-parametric regression is the cubic smoothing spline. Popular nowadays are Gaussian Process Regressions and Regression Splines. The splines implemented by the MGCV package are not “true” RKHS regressions, but rather regression splines with automatically chosen knot points. Under nice enough data, these approximate the kernel regressions fitted by smoothing splines with much better computational performance. More complicated smoothing splines, such as arbitrary kernel regressions or semiparametric mixed effect models, would be better fit by a package like GSS or ASSIST.
When we fit the GAM below, we include the following things: linear terms for age, bmi, hypertension, and average glucose level. Then we include smooth terms for age, bmi, and the interaction between age and bmi.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.000000000 0.000000000 NaN NaN
## age 0.075000521 0.007332206 10.228916 1.471572e-24
## bmi -0.279685160 0.016791754 -16.656102 2.732646e-62
## hypertension 0.397401027 0.162903010 2.439495 1.470782e-02
## avg_glucose_level 0.004178781 0.001185224 3.525731 4.223152e-04
## edf Ref.df Chi.sq p-value
## ti(age) 0.1596484 0.2862281 0.05878687 0.80842452
## ti(bmi) 3.1837983 3.6391825 102.86146769 0.00000000
## ti(age,bmi) 2.8100385 3.6934406 8.41877968 0.05768213
To read the above summary, we note that BMI’s linear term has become significant and the interaction term “ti(age,bmi)” is significant! This indicates that there is some kind of an interaction between age and bmi in how they predict the risk of stroke. Before we move on, because of the fickle nature of non-parametric models, it is prudent to check with a couple extra models just to be sure it is not a fluke. Below, we fit a bivariate cubic spline, a main-effect adjusted bivariate cubic spline, and a gaussian process smooth.
## edf Ref.df Chi.sq p-value
## s(age,bmi) 6.034942 8.641374 192.599 0
## edf Ref.df Chi.sq p-value
## ti(age) 2.302803 2.544600 3.611168 0.23902136
## ti(bmi) 3.094177 3.557945 46.475043 0.00000000
## s(age,bmi) 2.997936 27.000000 6.778965 0.01411261
## edf Ref.df Chi.sq p-value
## ti(age) 2.251027 2.488272 3.191503 0.29795986
## ti(bmi) 3.082044 3.551451 7.033040 0.06937026
## s(age,bmi) 2.776575 30.000000 6.707165 0.01364100
Because they all report the interaction to be significant, I will include them in our model. We can next investigate how exactly the interaction behaves by plotting a slice of it:
It seems from the heatmap and the surface plot that theres a strange interaction. In younger patients, stroke is less likely when BMI is lower. Then, when the patient is old (75 years or older), the risk is lowest when the BMI is between 35 and 45. Some google searches suggest that the medical literature is unsure about the interaction between age, BMI, and stroke. One study https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/754810 suggests that BMI increases stroke risk when age is not adjustested for. Another, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6719766/, suggests that severity and mortality sigificantly decreases as BMI increases among old patients. This dataset may be a little too small to draw any very meaningful conclusions about the relationship, but it is interesting to see those hypothesis reflected in an “ideal BMI range” of 35-45 in very old patients.
Now that we’ve adjusted for the strange interaction between BMI and age, we can investigate if there are any more. Lets check gender:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.000000000 0.000000000 NaN NaN
## age 0.110992909 0.032850129 3.378766 7.281189e-04
## bmi -0.222468407 0.040154983 -5.540244 3.020504e-08
## hypertension 0.375221261 0.164167528 2.285600 2.227768e-02
## avg_glucose_level 0.005253376 0.001727076 3.041775 2.351878e-03
## genderMale -8.443054850 3.238399259 -2.607169 9.129423e-03
## edf Ref.df Chi.sq
## s(I((gender == "Female") * bmi)) 1.0001185 1.000226 1.4849982
## s(I((gender == "Female") * age)) 4.9237130 5.904177 6.9461201
## s(I((gender == "Female") * avg_glucose_level)) 3.1717088 3.940476 5.9136064
## ti(age) 0.8668073 1.151921 0.3101391
## ti(bmi) 3.4478362 3.797142 48.7405671
## ti(age,bmi) 2.7379032 3.765661 7.4422522
## p-value
## s(I((gender == "Female") * bmi)) 0.2231699
## s(I((gender == "Female") * age)) 0.2869973
## s(I((gender == "Female") * avg_glucose_level)) 0.2035486
## ti(age) 0.4622605
## ti(bmi) 0.0000000
## ti(age,bmi) 0.1192928
There do not seem to be any meaningful interactions between gender and the other continuous variables, but now the linear effect for gender is significant. The ineraction between age and BMI has had its p value reduced, but that is to be expected when we introduce several new degrees of freedom to this kind of model.
While we could investigate the other variables similarly, I will spoil the ending and say that I did not find any other interesting interactions so we can move on to the next part of the investigation.
Now that we looked at how our covariates affect the risk of stroke, we can also see if there are any interesting relationships between the covariates themselves. First, lets look at a t-SNE
t Distributed Stochastic Neighbor Embedding of the numerical covariates
I really do not see anything here. Lets look at te embedding for just the variables age, bmi, smoking_status, gender, hypertension , heart_disease, and avg_glucose_level.
t Distributed Stochastic Neighbor Embedding of the reduced data set
As far as I can tell, there isn’t a very easily interpretable clustering structure from the T-SNE. The embedding in the reduced data seems to have some kind of structure, but I do not know how to read it. Maybe we can see a little more if we try a PCA?
There may be some clustering going in the plots of the first component against the second or the second component against the third, but it is not so pronounced.
First and Second Principal Components
Finally, lets try throwing a more powerful model at the problem and asking what they think. For this, we will use a boosted random forest.
Interesting! The gbm model mostly agrees with our previous investigation: age and average glucose level are most important, then bmi heart disease and hypertension. However, after running the model several times, sometimes work type is used, and sometimes not. Similar with gender. Now lets see the F1 score and an ROC curve
## [1] 0.9502208
## [1] 0.008064516
## [1] 0.007996649
The ROC curve suggests a threshold of about 0.4, which will give us an F1 score of around 0.02, which is far from ideal. The precision tends to be good, its the recall that is very low. Though, because of the low incidence rate, we should expect some degree of zero inflation, so this poor f1 score is not unexpected.
To wrap up this project, lets summarize what we learned: stroke risk increases with age, cardiovascular disease, hypertension, and average blood glucose levels (diabetes). Risk also depends on BMI, but higher BMI’s are riskier in young (<65 year old) patients while there seems to be a lowest risk range from 35-45 BMI in older patients. Gender was marginally significant in our Generalized Additive Model, but only some of the time in the Generalized Boosted Model. Strong associations between covariates were found in the categorical variables, particularly between hypertension and cardiovascular disease, but clustering structure among the continuous covariates were too complicated for my ability to understand. Overall, I think the most interesting find is the complicated interaction between BMI’s and age’s effect on stroke risk.